The data we are working with for this project examines public records that report the number of arrests that took place during the years 2011 to 2015 at National Football League stadiums. There are 5 numerical variables (season, week_number, home_score, away_score, arrests) and 6 categorical variables (day_of_week, gametime_local, home_team, away_team, OT_flag, division_game). We will examine the relationship between the game times, outcomes, and time of year of an NFL game all relative to the number of arrests at the stadium during the game. First sift through the data to find any possible confounding variables. Our hypothesis is there is a positive correlation between tightly contested games such as in-division, close margin of victory, or overtime contests. To test this hypothesis, we will graph arrests vs margin of victory for the whole NFL and then zoom in on certain teams, to show that there is typically a negative correlation.
Our project could provide valuable insight into the public safety ramifications of attending an NFL game. While we hope to show that there is a relationship between the margin of victory and the safety of the game, this project could also open the doors to others who wish to investigate the root of this relationship.
library(ggplot2)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(broom)
library(infer)
arrests <- read.csv("arrests.csv")
dim(arrests)
## [1] 1006 11
NFLarrests <- arrests %>%
mutate(margin_of_vic = home_score - away_score)
Here we have created a new variable that quantifies how close the final score of a game was. We hope to see a relationship between games that are closer (typically more exciting) and the number of arrests at the game.
ggplot(NFLarrests, aes(y = arrests, x = season)) +
geom_point(alpha = 0.5)
## Warning: Removed 40 rows containing missing values (geom_point).
The number of arrests seems to remain relatively constant across all seasons. This could be indicative of NFL stadiums not attempting to increase the safety of stadiums. A high number of arrests could also mean that the safety is taken more seriously and they are willing to arrest people who are being unsafe.
ggplot(NFLarrests, aes(y = arrests, x = home_team)) +
geom_point(alpha = 0.5) + theme(axis.text.x = element_text(angle = 45, hjust = 1))
## Warning: Removed 40 rows containing missing values (geom_point).
This plots the home team versus the number of arrests made. There does appear to be a few stadiums that average more arrests than the others. Of these include New York (this holds for both NY teams, so possibly indicative of stadium conditions and not the teams), Oakland, Pittsburgh and San Diego.
ggplot(NFLarrests, aes(y = arrests, x = away_team)) +
geom_point(alpha = 0.5) + theme(axis.text.x = element_text(angle = 45, hjust = 1))
## Warning: Removed 40 rows containing missing values (geom_point).
The away team was also compared to the number of arrests in order to show that the correlation made between arrests and home team was not random. Because the away team appears to be irrelevant to the number of arrests, we can start to conclude that the number of arrests can partly depend on what stadium we are in are not the teams playing.
ggplot(NFLarrests, aes(y = arrests, x = margin_of_vic, color = home_team)) +
geom_point(alpha = 0.5)
## Warning: Removed 40 rows containing missing values (geom_point).
This compares the margin of victory with the home team. This seems to indicate that the the smaller the margin of victory, the more arrests we see. It also is consistent that we see the teams we noted above as having the most arrests.
ggplot(NFLarrests, aes(y = arrests, x = margin_of_vic, color = away_team)) +
geom_point(alpha = 0.5)
## Warning: Removed 40 rows containing missing values (geom_point).
To further visualize the previous suggestions, we compared the margin of victory to arrests while color coordinating away teams. No pattern was seen between away team and number of arrests but we continue to see the trend of a low margin of victory being associated with high arrests.
ggplot(NFLarrests, aes(y = arrests, x = margin_of_vic, color = OT_flag)) +
geom_point(alpha = 0.5)
## Warning: Removed 40 rows containing missing values (geom_point).
Overtime games typically have much smaller margins of victory. First we need to see how many there are and what the variability in this relationship looks like. While there are several overtime games that show high arrest numbers, they do not seem more prone to arrests than other games with low margin of victory.
ggplot(NFLarrests, aes(y = arrests, x = week_num, color = home_team)) +
geom_point(alpha = 0.5)
## Warning: Removed 40 rows containing missing values (geom_point).
Here we compare the number of arrests with the week number that the game took place in. No discernible effect by the week number can be seen here although the teams with higher arrests noted before can be distinguished.
ggplot(NFLarrests, aes(y = arrests, x = day_of_week, color = home_team)) +
geom_point(alpha = 0.5)
## Warning: Removed 40 rows containing missing values (geom_point).
Here we have plotted the day of the week the game took place on against arrests. While Sundays appear to have many arrests, most games typically happen on Sundays. It does not appear that the day of the week has a noticeable effect on the number of arrests.
ggplot(NFLarrests, aes(y = arrests, x = gametime_local, color = home_team)) +
geom_point(alpha = 0.5) + theme(axis.text.x = element_text(angle = 60, hjust = 1))
## Warning: Removed 40 rows containing missing values (geom_point).
When looking at the local game time, it does seem intially like there is a pattern. It appears there are three big spikes in the game times that have more arrests. This is because NFL games are scheduled in EST, and the teams we noted above as higher risk are mostly in EST so their games all align and create these spikes. The dips are teams who are not in EST and also do not have high number of arrests.
ggplot(NFLarrests, aes(x = arrests, fill = division_game)) +
geom_histogram() +
facet_wrap(~ division_game)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 40 rows containing non-finite values (stat_bin).
Here we compare the number of arrests base on if the game was a division game or not. These charts seem to have similar distributions and the main difference is that division games occur at a lower frequency than non division games.
ggplot(NFLarrests,
aes(y = arrests, x = margin_of_vic, color = margin_of_vic > 0)) +
geom_point(alpha = 0.5)
## Warning: Removed 40 rows containing missing values (geom_point).
This plots the arrests against the margin of victory. It also distinguishes between colors such that home teams lost where the dots are red and won when dots are blue. Based of the plot, there seems to be a correlation between arrests and how close the game was as the data points increase towards 0. In addition, the variability of number of arrests seems to increase the closer you get to zero. Further investigation of both the mean and variability at different margin of victories should be done to see if there is a correlation.
NFLarrests <- NFLarrests %>%
mutate(abs_margin_of_vic = abs(margin_of_vic))
We mutate here to observe the absolute value of the margin of victory. The data seems centered at the mean and we are more interested in how close the game was than who won.
ggplot(NFLarrests,
aes(y = arrests, x = abs_margin_of_vic, color = margin_of_vic > 0)) +
geom_point(alpha = 0.5)
## Warning: Removed 40 rows containing missing values (geom_point).
From visual analysis, it seems that most of the highest arrest games also coincide with the lowest margin of victory games. This could be indicative of a relationship between margin of victory and arrests.
model_nfl <- lm(arrests ~ abs_margin_of_vic, data = NFLarrests)
summary(model_nfl)
##
## Call:
## lm(formula = arrests ~ abs_margin_of_vic, data = NFLarrests)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.156 -5.884 -3.906 1.157 62.170
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.21037 0.49405 14.59 <2e-16 ***
## abs_margin_of_vic -0.05437 0.03275 -1.66 0.0973 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.503 on 964 degrees of freedom
## (40 observations deleted due to missingness)
## Multiple R-squared: 0.00285, Adjusted R-squared: 0.001815
## F-statistic: 2.755 on 1 and 964 DF, p-value: 0.09728
confint(model_nfl)
## 2.5 % 97.5 %
## (Intercept) 6.2408242 8.179912926
## abs_margin_of_vic -0.1186433 0.009911778
Using the overall Nfl arrest data, we are 95% confident that there is a slight negative correlation between the margin of victory in an NFL game and the arrests during the game. However, by looking at the chart above, it is unclear if this is biased by outliers like New York or if it holds for all teams.
ggplot(NFLarrests,
aes(y = arrests, x = abs_margin_of_vic, color = margin_of_vic > 0)) +
geom_point(alpha = 0.5) + geom_abline(slope = -.05437, intercept = 7.21037, color = "red")
## Warning: Removed 40 rows containing missing values (geom_point).
Looking at the home_team vs. arrests chart, the NY Giants, NY Jets, Oakland (now Las Vegas) Raiders, and Pittsburgh Steelers all have floor numbers of arrests that are greater than zero, so we are going to isolate each team’s home game data to see their individual margin of victory versus arrest relationships.
NFLarrests <- NFLarrests %>%
mutate(margin_of_vic = abs(margin_of_vic))
Here we change the variable margin of victory into it’s absolute value form. This was done because we wanted to create a linear model of all positive values. It doesn’t jeopardize the data because the purpose is to examine how close games affect the number of arrests. In other words, if people in a more excited state tend to be arrested more.
OakArrests <- NFLarrests %>%
filter(home_team == "Oakland")
Here we have filtered the data to just games that occured in Oakland.
model_oak <- lm(arrests ~ margin_of_vic, data = OakArrests)
summary(model_oak)
##
## Call:
## lm(formula = arrests ~ margin_of_vic, data = OakArrests)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.5799 -4.0836 -0.8809 2.3202 31.2189
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 18.87948 2.32662 8.115 1.48e-09 ***
## margin_of_vic -0.09985 0.17718 -0.564 0.577
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.774 on 35 degrees of freedom
## Multiple R-squared: 0.008993, Adjusted R-squared: -0.01932
## F-statistic: 0.3176 on 1 and 35 DF, p-value: 0.5766
confint(model_oak)
## 2.5 % 97.5 %
## (Intercept) 14.1561810 23.6027766
## margin_of_vic -0.4595502 0.2598422
From the summary, we extract the values of the slope and interval, in order to create a best fit line. In addition, the 95% confidence interval for the Oakland Raiders is [-0.4595502 0.2598422]. Because this interval contains positive and negative values, we can not establish a relatively confident relationship between the values.
ggplot(OakArrests, aes(y = arrests, x = abs_margin_of_vic)) +
geom_point() +
geom_abline(slope = -0.09985,
intercept = 18.87948,
color = "red")
There is no signifigant trend between the absolute margin of victory and arrests in Oakland.
PittArrests <- NFLarrests %>%
filter(home_team == "Pittsburgh")
Here we have filtered the data to just games that occured in Pittsburgh.
ggplot(PittArrests, aes(y = arrests, x = margin_of_vic)) +
geom_point() +
geom_abline(slope = -0.1948,
intercept = 18.9663,
color = "red")
model_pitt <- lm(arrests ~ margin_of_vic, data = PittArrests)
summary(model_pitt)
##
## Call:
## lm(formula = arrests ~ margin_of_vic, data = PittArrests)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.382 -9.206 -3.044 6.414 40.346
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 18.9663 3.3873 5.599 2.02e-06 ***
## margin_of_vic -0.1948 0.2400 -0.812 0.422
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.68 on 38 degrees of freedom
## Multiple R-squared: 0.01705, Adjusted R-squared: -0.008818
## F-statistic: 0.6591 on 1 and 38 DF, p-value: 0.4219
confint(model_pitt)
## 2.5 % 97.5 %
## (Intercept) 12.1089440 25.8235749
## margin_of_vic -0.6806636 0.2909916
The 95% confidence interval for the Pittsburg Steelers is [-0.6806636 0.2909916] Because this interval contains positive and negative values, we can not establish a relatively confident relationship between the values. Created a visualization of this team’s margin of victory versus arrest relationship.
NYGiants_arrests <- NFLarrests %>%
filter(home_team == "New York Giants")
head(NYGiants_arrests)
## season week_num day_of_week gametime_local home_team away_team
## 1 2011 2 Monday 8:30:00 PM New York Giants St. Louis
## 2 2011 5 Sunday 1:00:00 PM New York Giants Seattle
## 3 2011 6 Sunday 1:00:00 PM New York Giants Buffalo
## 4 2011 8 Sunday 1:00:00 PM New York Giants Miami
## 5 2011 11 Sunday 8:20:00 PM New York Giants Philadelphia
## 6 2011 13 Sunday 4:15:00 PM New York Giants Green Bay
## home_score away_score OT_flag arrests division_game margin_of_vic
## 1 28 16 18 n 12
## 2 25 36 15 n 11
## 3 27 24 19 n 3
## 4 20 17 23 n 3
## 5 10 17 26 y 7
## 6 35 38 35 n 3
## abs_margin_of_vic
## 1 12
## 2 11
## 3 3
## 4 3
## 5 7
## 6 3
Here we have filtered the data to just games that occurred in the New York stadium with the Giants as the home team.
Giants_model <- lm(arrests ~ margin_of_vic, data = NYGiants_arrests)
summary(Giants_model)
##
## Call:
## lm(formula = arrests ~ margin_of_vic, data = NYGiants_arrests)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.8386 -6.2806 0.1715 4.8103 20.5452
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 26.6669 2.2639 11.779 2.98e-14 ***
## margin_of_vic -0.4040 0.1759 -2.297 0.0272 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.472 on 38 degrees of freedom
## Multiple R-squared: 0.1219, Adjusted R-squared: 0.0988
## F-statistic: 5.276 on 1 and 38 DF, p-value: 0.02723
confint(Giants_model)
## 2.5 % 97.5 %
## (Intercept) 22.0839331 31.24993921
## margin_of_vic -0.7601545 -0.04792955
Using a linear regression model, we compared the margin of victory to arrests in the New York Giants data set. From the summary, we can see the residuals, intercepts and slopes that we can use for creating a line that estimates the data. The 95% confidence interval for the slope of the relationship between the margin_of_victory and number of arrests at NY Giants home games is [-0.7601545, -0.04792955]. This indicates that though we do not know the true correlation coefficient between these two variables, we are 95% confident that the true value lies between -0.7601545 and -0.04792955. The fact that the confidence interval does not contain any positive values means that we can conclude with relative confidence that the greater the margin of victory, the fewer arrests will be made.
ggplot(NYGiants_arrests, aes(x = margin_of_vic, y = arrests, color = OT_flag)) +
geom_point() +
geom_abline(slope = -0.4040,
intercept = 26.6669,
color = "red")
This provides a linear regression model that shows the general trend between the margin of victory and the number of arrests at NY Giants home games.
NYJets_arrests <- NFLarrests %>%
filter(home_team == "New York Jets")
head(NYJets_arrests)
## season week_num day_of_week gametime_local home_team away_team
## 1 2011 1 Sunday 8:20:00 PM New York Jets Dallas
## 2 2011 2 Sunday 1:00:00 PM New York Jets Jacksonville
## 3 2011 6 Monday 8:30:00 PM New York Jets Miami
## 4 2011 7 Sunday 1:00:00 PM New York Jets San Diego
## 5 2011 10 Sunday 8:20:00 PM New York Jets New England
## 6 2011 12 Sunday 1:00:00 PM New York Jets Buffalo
## home_score away_score OT_flag arrests division_game margin_of_vic
## 1 27 24 34 n 3
## 2 32 3 23 n 29
## 3 24 6 28 y 18
## 4 27 21 20 n 6
## 5 16 37 30 y 21
## 6 28 24 12 y 4
## abs_margin_of_vic
## 1 3
## 2 29
## 3 18
## 4 6
## 5 21
## 6 4
Jets_model <- lm(arrests ~ margin_of_vic, data = NYJets_arrests)
summary(Jets_model)
##
## Call:
## lm(formula = arrests ~ margin_of_vic, data = NYJets_arrests)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.8846 -6.3692 -0.3916 6.3230 22.9457
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 20.79486 2.39397 8.686 1.47e-10 ***
## margin_of_vic 0.05189 0.15462 0.336 0.739
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.765 on 38 degrees of freedom
## Multiple R-squared: 0.002955, Adjusted R-squared: -0.02328
## F-statistic: 0.1126 on 1 and 38 DF, p-value: 0.739
confint(Jets_model)
## 2.5 % 97.5 %
## (Intercept) 15.9485088 25.6412061
## margin_of_vic -0.2611174 0.3649023
Replicating our work for the New York Giants, we collected the values of residuals, slope and intercepts, and continue by constructing a 95% confidence interval for the slope of the relationship between the margin_of_victory and number of arrests at NY Jets home. This value lies between [-0.2611174, 0.3649023]. This indicates that though we do not know the true correlation coefficient between these two variables, we are 95% confident that the true value lies between these two values.
ggplot(NYJets_arrests, aes(x = margin_of_vic, y = arrests)) +
geom_point() +
geom_abline(slope = 0.05189,
intercept = 20.79486,
color = "red")
Here we have created a visualization of the Jet’s margin of victory versus arrest relationship.
OTGames_arrests <- NFLarrests %>%
filter(OT_flag != "")
head(OTGames_arrests)
## season week_num day_of_week gametime_local home_team away_team home_score
## 1 2011 9 Sunday 2:15:00 PM Arizona St. Louis 19
## 2 2011 13 Sunday 2:15:00 PM Arizona Dallas 19
## 3 2011 15 Sunday 2:15:00 PM Arizona Cleveland 20
## 4 2011 17 Sunday 2:15:00 PM Arizona Seattle 23
## 5 2012 4 Sunday 1:05:00 PM Arizona Miami 24
## 6 2012 6 Sunday 1:05:00 PM Arizona Buffalo 19
## away_score OT_flag arrests division_game margin_of_vic abs_margin_of_vic
## 1 13 OT 6 y 6 6
## 2 13 OT 3 n 6 6
## 3 17 OT 1 n 3 3
## 4 20 OT 4 y 3 3
## 5 21 OT 4 n 3 3
## 6 16 OT 1 n 3 3
Here we filtered the data set to isolated games that went to overtime.
ggplot(OTGames_arrests, aes(x = margin_of_vic, y = arrests)) +
geom_point()
## Warning: Removed 3 rows containing missing values (geom_point).
The construction of a plot to visualize the Overtime data.
dangerZone <- NFLarrests %>%
filter(home_team == "Oakland" | home_team == "Pittsburgh" | home_team == "New York Giants" | home_team == "New York Jets")
t.test(x=dangerZone$arrests, mu=6.566)
##
## One Sample t-test
##
## data: dangerZone$arrests
## t = 16.553, df = 156, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 6.566
## 95 percent confidence interval:
## 18.08837 21.21099
## sample estimates:
## mean of x
## 19.64968
In this filtered data set, we selected a few stadiums that appeared to have statistically high arrest values compared to the rest of the NFL. We subsequently took a t-test with 156 degrees of freedom to examine whether the mean of this data set was higher, and thus quantify whether certain stadiums were more prone to arrests. Here we can see that there is a significant difference between the mean number of arrests at Oakland, Pittsburgh, New York Giants, and New York Jets games over the mean number of arrests at NFL games. This indicates to us that these games are more dangerous than the others.
low_mov <- NFLarrests %>%
filter(margin_of_vic <= 7)
high_mov <- NFLarrests %>%
filter(margin_of_vic > 7)
var.test(low_mov$arrests, high_mov$arrests)
##
## F test to compare two variances
##
## data: low_mov$arrests and high_mov$arrests
## F = 1.2239, num df = 447, denom df = 517, p-value = 0.02667
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
## 1.023641 1.464969
## sample estimates:
## ratio of variances
## 1.223859
After the initial examination their appeared to be a higher variance in the number of arrests when the margin of victory approached zero. In order to examine the significance, the data set was filtered into to two separate data sets. One where the margin of victory was less than or equal 7, and another that was greater than 7(value of a touchdown).We then conducted an F-statistic test between the corresponding variances. Our 95% confidence interval gave us a ratio of variance between [1.023641, 1.464969] with a ratio of variances 1.223859. Because our smaller margin of victories was in the numerator, this ratio does in fact reaffirm our prediction that the variance is greater in lower margin of victoriy games.
When examining the data, we found less trends than we initially expected. It seems most teams had a relatively low amount of arrests. The most significant trend we found was a slight negative correlation between the absolute value of the margin of victory and the number of arrests at the New York Giants home games. Though when comparing the data between the New York Giants and the New York Jets an interesting difference was found. While both of these teams play their home games at MetLife Stadium, the Giants experienced more arrests with lower margin of victory, yet the Jets did not show the same trend. This could be caused by a number of factors, such as differences in fan base or that the Jets have had significantly less success than the Giants in recent years.
Additionally, when we compared the number of arrests of Oakland, Pittsburgh, and both New York teams to the NFL average, they were significantly different. When running a one sample t-test we found a p < 2.2e-16, showing that these cities have higher arrest numbers than the NFL on average.
One final find of ours was the difference in variability in arrests between games with a high margin of victory (>7) and games with a low margin of victory (<=7). A significant difference was found between these two variances, leading us to believe that the crowds at games with low margin of victory are more volatile and possibly more dangerous.
In totality, NFL stadiums appear to be relatively safe and the fact that they are making arrests, suggests that teams do have interest in public safety. Future research could look into the relationship between the success of a team and the number of arrests or possibly explore more data from the cities in which these teams are located.